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Executive  Summary 


This  research  aims  to  develop  new  and  more  accurate  stochastic  models  for  speaker-independent 
continuous  speech  recognition  by  extending  previous  work  in  segment-based  modeling  and  by  intro¬ 
ducing  a  new  hierarchical  approach  to  representing  intra-utterance  statistical  dependencies.  These 
techniques,  which  have  high  computational  costs  because  of  the  large  search  space  associated  with 
higher  order  models,  are  made  feasible  through  rescoring  a  set  of  HMM-generated  N-best  sentence 
hypotheses.  W'e  e.xpect  these  different  modeling  techniques  to  result  in  improved  recognition  per¬ 
formance  over  that  achieved  by  current  systems,  which  handle  only  frame-based  observations  and 
assume  that  these  observations  are  independent  given  an  underlying  state  sequence. 

In  the  past  year,  the  accomplishments  of  this  project,  funded  in  part  by  a  related  .\Hl’.-\-.\.SF 
grant  (.\SF  no.  IRI-890212-1).  include: 

•  improved  the  N-best  rescoring  paradigm  by  introducing  score  normalization  and  more  robust 
weight  estimation  techniques: 

•  investigated  techniques  for  improving  the  baseline  stochastic  segment  model  (.S.S.M)  system, 
including  context  clustering  for  robust  parameter  estimation,  tied  nii.xture  distributions  at  the 
frame  and  segment  level,  a  two  level  segment/microsegment  form.alism,  multiple  pronunciation 
word  models,  and  automatic  distribution  mapjting  estimation; 

•  extended  the  classification  and  segmentation  scoring  formalism  to  context-dependent  mod¬ 
eling  without  assuming  independence  of  observations  in  ciifTerenl  segments,  whicii  opens  (he 
possibility  for  a  broader  class  of  features  for  recognition; 

•  demonstrated  results  comparable  to  the  best  lIMM  systems  on  the  Resource  Management. 
Switchboard  and  Wall  Street  .Journal  tasks; 

•  developed  an  initial  dependenry  tree  model  of  intra-utterance  observation  correlation;  and 

•  implemented  and  evaluated  baseline  n-gram  language  models,  and  developed  new  language 
models  to  handle  topic-related  language  dynamics  and  variations  in  verbalized  numbers  and 
punctuation. 

We  currently  report  baseline  SSM  results  on  the  Wall  Street  Journal  task  that  represent  im 
proved  performance  over  all  results  reported  in  November  1902.  For  the  .5k  vocabulary,  non- 
verbalized  punctuation  test  set  and  the  bigrain  language  model,  we  achieve  8.1%  error  with  the 
SSM  and  7. .3%  error  with  the  combined  HMM-SSM  system,  which  can  be  compared  to  reported 
rates  of  8.7%  -  15%,  for  comparable  IIM.M  systems.  In  addition,  we  see  much  room  for  further 
improvement,  as  these  models  still  rely  on  ai\  assumption  of  conditional  independence  and  do  not 
take  full  advantage  of  the  segment  formalism. 
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Principal  Investigator  Name:  Man  Osteiuiorf 
PI  Institution:  Boston  Pniversit.v 
PI  Phone  Number.  617-353-5130 
PI  E-mail  Address:  mo@raven  bu  edu 

Grant  or  Contract  Title  Segment- Based  Acoustic  Models  for  Continuous  Speech  Recognition 
Grant  oi  Contract  Number:  Go' t<.-i'i’uuui4-92-J-177o 
Reporting  Period.  1  Oct  1992  -  30  September  1993 

1  Productivity  Measures 

•  Refereed  papers  submitted  but  not  yet  published:  1 

•  Refereed  papers  published:  1 

•  I'nrefereed  reports  and  articles:  3 

•  Books  or  parts  thereof  submitted  but  not  yet  published:  0 

•  Books  or  parts  thereof  published:  0 

•  F’atents  filed  but  not  yet  granted:  0 

•  Patents  granted  (include  software  copyrights):  0 

•  Invited  presentations:  0 

•  Contributed  presentations:  1  talk.  1  poster 

•  Honors  received: 

Prof.  M.  Ostendorf:  Served  on  the  IF.EE  Signal  Processing  Society  Speech  Technical  Com¬ 
mittee;  Chosen  to  chair  the  199G  .ARP.A  Workshop  on  Human  Language  d'cclinology;  Invited 
to  participate  in  the  Dol)  worksho]).  Frontiers  in  Speech  Processing  ■  Robust  S])oech  Recog¬ 
nition.  Dr.  J.  R.  Rohlicek:  Chostui  to  serve  as  an  .Associate  Editor  for  lEF.E  Signal  Erurtssing 
Letters. 

•  Prizes  or  awards  received:  0 

•  Promotions  obtained:  At  Boston  Fniversity,  Prof.  Ostendorf  was  granted  tenure  and  pro 
moted  from  Assistant  Professor  to  As.sociate  Professor.  At  BB.V,  Dr.  Rohlicek  was  promoted 
to  Division  Scientist. 

•  Graduate  students  supported  >  of  full  time:  2-4 

•  Post-docs  supported  >  25'X  of  full  time:  0 

•  Minorities  supported:  1  woman 
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2  Summary  of  Technical  Progress 

In  this  work,  we  are  interested  in  the  problem  of  largo  vocabulary,  speaker-independent  contin¬ 
uous  speech  recognition,  and  primarily  in  the  acoustic  modeling  component  of  this  problem.  In 
developing  acoustic  models  for  speech  recognition,  we  have  conflicting  goals.  On  one  hand,  the 
models  should  be  robust  to  inter-  and  intra-speaker  variability,  to  the  use  of  a  different  vocabulary 
in  recognition  than  in  training,  and  to  the  effects  of  moderately  noisy  environments.  In  order  to 
accomplish  this,  we  need  to  model  gross  features  and  global  trends.  On  the  other  hand,  the  models 
must  be  sensitive  and  detailed  enough  to  detect  fine  acoustic  differences  between  similar  words  in 
a  large  vocabulary  task.  To  answer  those  opposing  demands  reriuires  improvements  in  acoustic 
modeling  at  several  levels:  the  frame  level  (e.g.  signal  jirocessing),  the  phoneme  level  (e.g.  model¬ 
ing  feature  dynamics),  and  the  utterance  level  (e.g.  defining  a  structural  context  for  representing 
the  intra-utterance  dependence  across  phonemes).  This  project  addresses  the  problem  of  acoustic 
modeling,  specifically  foct|cing  on  modeling  at  the  segment  level  and  above.  The  research  strat¬ 
egy  includes  three  main  thrusts.  Fir.st,  phone-level  acoustic  modeling  is  ba.sed  on  the  stochastic 
segment  model  (SSM).  and  in  this  area  our  main  efforts  involve  developing  new  techniques  for 
robust  context  modeling,  nmehanisms  for  effectively  incorporating  segmental  features,  and  models 
of  within-segment  dependence  of  frame-based  features.  Second,  high-level  models  are  being  ex¬ 
plored  in  order  to  capture  speaker-dependent  and  session-dependent  effects  within  the  context  of 
a  speaker-independent  model.  In  particular,  we  are  investigating  hierarchical  structures  for  rep¬ 
resenting  the  intra-utterance  dependency  of  phonetic  models,  and  more  recently  language  models 
for  representing  topic  dependency  and  language  dynamics,  recognizing  that  higher-order  models 
of  correlation  can  extend  to  the  language  domain  as  well  as  the  acoustic  domain.  Lastly,  speech 
recognition  is  implemented  under  the  N-best  rescoring  paradigm,  in  which  the  IJII.N  Byblos  sys¬ 
tem  is  used  to  constrain  the  stochastic  segment  model  (SSM)  search  space  by  providing  the  top 
N  sentence  hypotheses.  This  paradigm  facilitates  research  on  high-order  models  through  reducing 
development  costs,  and  provides  a  modular  framework  for  technology  transfer  that  has  already 
enabled  us  to  advance  state-of-the-art  recognition  performance  through  collaboration  with  BUN. 

In  the  first  year  of  this  project,  we  have  focused  on  improving  the  performance  of  the  basic 
segment  word  recognition  system  and  porting  the  system  to  the  Wall  Street  Journal  task  domain. 
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The  different  accomplishments  and  advances,  some  of  which  were  supported  in  part  by  an  ARPA 
NSF  grant  (NSF  no.  1111-890212-1),  are  detailed  below. 

N-Best  rescoring.  VVe  developed  a  grid-based  search  to  avoid  local  optima  in  the  weight 
optimization  criterion,  together  with  methods  for  choosing  among  different  local  optima  to  obtain 
more  robust  restilts.  We  also  found  that  normalization  of  scores  by  observation  length  (e.g.,  frame, 
phoneme,  or  word  count,  depending  on  the  score)  prior  to  the  linear  combination  allows  us  to 
obtain  more  robust  weights  and  leads  to  a  small  reduction  in  error  rate. 

Improvements  to  the  SSM.  VVe  focused  on  improving  the  performance  of  the  basic  seg¬ 
ment  word  recognition  system.  In  brief,  the  accomplishments  of  that  period  include  the  following: 
1)  development  of  a  method  for  clustering  contexts  to  provide  robust  context-dependent  model 
parameter  estimates  using  a  likelihood  ratio  test  to  obtain  ML  estimates  of  tied  covariances,  ob¬ 
taining  a  factor  of  10  reduction  in  memory  costs  with  no  loss  in  performance;  2)  extension  of  the 
two  level  segment/microsegment  formalism  (to  context-dependent  modeling  in  word  recognition) 
and  assessment  of  trade-offs  in  mixture  vs.  trajectory  modeling,  finding  that  (non-tied)  mixtiiies 
are  more  useful  for  context-independent  modeling  and  constraine<l  trajectories  are  more  appropri¬ 
ate  for  context-dependent  modeling:  3)  investigation  of  the  use  of  tied  mixtures  at  the  frame  level 
and  at  the  segment  level,  looking  at  trade-offs  of  different  methods  for  parameter  initialization  and 
different  regions  of  parameter  tying,  achieving  a  209(  reduction  in  word  error  on  the  R.\I  task  by- 
using  frame-level  full  covariance  tied  mixtures,  but  no  gains  on  the  \V.S.J  la.sk;  -1)  developi  .ent  and 
assessment  of  automatically  generated  mull iple-[)roniinciation  word  networks  (no  irerformnnce  im¬ 
provements  obtained  in  experiments  on  the  Resource  Mangagement  task,  but  higher  tpiality  ])lione 
alignments  are  obtained  in  other  tasks):  o)  implementation  of  optional  silence  insertion  in  both 
recognition  and  training,  which  led  to  a  slight  improvement  in  performance  on  VV.S.I;  and  (>)  auto¬ 
matic  distribution  mapping  estinmtion  using  a  maximum  likelihood  criterion,  which  is  an  important 
development  needed  forextending  the  segment  model  to  different  speech  units  and  different  featurt' 
sets. 

CIR  framework.  One  approach  to  segment- basted  modeling  is  to  do  “classification  in  recog¬ 
nition”  (CIR),  or  classification  of  a  variable-length  segment  using  a  posterior  distribution  based 
on  fixed-length  features,  a  useful  formalism  bccau.se  it  opens  the  possibility  for  a  broader  class  of 
features  for  recognition.  In  the  past,  we  have  shown  that  this  approach  requires  both  classification 
and  segmentation  scoring  to  be  effective.  In  this  project,  we  made  an  important  step  forward  in 
building  a  formalism  for  using  posterior  distributions  in  classification  through  our  development  of  a 
mechanism  to  handle  context-dependent  models  without  recpiiring  the  assumption  of  independence 
of  features  spanning  different  phone  segments.  'Lhe  context-dependent  model  was  derived  using 
a  maximum  entropy  criterion  in  estimating  a  combined  function  of  posterior  probability  terms. 
This  formalism  will  allow  the  u.se  of  acoustic  measurements  over  a  Icmger  time  span  and  facilitate 
hierarchical  modeling.  Through  mathematical  analysis  as  well  as  experiments  in  context-dependent 
modeling,  we  uncovered  fundamental  proltlems  in  reported  implementations  of  context-dependemt 


CIR  scoring,  that  require  changes  to  the  classification  score. 

SSM  baseline  results.  We  have  ported  the  baseline  Sb.\I  to  the  Resource  Management, 
Switchboard  and  Wall  Street  Journal  tasks,  and  demonstrateil  speaker  independent  recognition 
results  comparable  to  the  best  HMM  systems.  On  the  Resource  Mauageriient  (RM)  task  we  report 
3.6%  word  error  on  the  October  19S9  Resource  Management  test  for  the  SSM  alone,  and  3.1 ‘a 
word  error  for  the  combined  SSM-HM.M  system.  {'I'he  best  reported  HMM  result  on  this  test 
set  is  LlMSI’s  3.2%  error  rate.)  On  the  September  1902  te.st  set  for  this  task,  our  [)erfonnance 
figures  are  7.3%i  and  6.1%  word  error  for  these  two  systems,  which  are  also  very  good  results  given 
the  difficulty  of  the  test  set.  We  ported  our  recognition  system  to  the  Switchboard  credit  card 
task,  as  part  of  our  participation  iit  the  Robust  Speech  Recognition  Workshoj)  at  Rutger's  this 
past  summei.  Our  results  of  29%  accuracy  for  gender-independent  models  was  comparable  to  all 
HMM  systems  reporting  on  this  task,  excluding  the  32-33%i  accuracy  achieved  by  systems  using 
gender-dependent  models.  Our  ba.^eline  results  on  the  Wall  Street  Journal  (WSJ)  .5k  vocabulary 
task  represent  improved  performance  over  all  results  reported  in  November  1992.  For  the  non- 
verbalized  punctuation  test  set  and  the  bigram  language  model,  we  achieve  8.1%  error  with  the 
SS.M  and  7.3%  error  with  the  combined  H.MM-SSM  system,  which  can  be  compared  to  re()orted 
rates  of  8.7%  -  15%  for  comparable  HMM  systems.  Interestingly,  our  best  results  on  the  WSJ 
task  are  based  on  full-covariance,  single-niode  Couissians.  while  the  best  results  on  the  R.M  task 
are  achieved  with  tied-mixture  models.  The  RM  results  of  the  context-clustering  algorithm  were 
confirmed  on  the  WSJ  task. 

Dependency  tree  model.  An  important  goal  of  this  project  is  the  development  of  a  hierar¬ 
chical  model  of  intra-utterance  correlation  of  ])hone  observations.  Our  initial  efforts  in  this  area 
have  been  to  extend  prior  work  on  finding  the  minimal  sptuming  dependence  trees,  from  the  dis¬ 
crete  models  of  Chow  and  Liu  to  Causs-.Markov  models  of  do[)en<leiH'e.  'I'he  initial  implementation 
favored  connections  between  infretiuently  observed  classes,  so  we  are  currently  investigating  robtist 
algorithms  for  designing  trees,  as  well  as  the  use  of  discrete  distribtition  depend('nce  trees  in  mix¬ 
ture  models.  In  order  to  quickly  assess  different  models  of  dependence  without  the  high  cost  of 
building  a  full  word  recognition  system,  we  are  initially  comparing  prediction  errors  for  different 
models  within  the  context  of  the  TI-MFL  corpus. 

Language  models.  Motivated  by  the  realization  that  inter-  and  intra-utterance  correlation  can 
be  modeled  at  the  language  as  well  as  acoustic  level,  we  have  begun  an  effort  in  dynamic  language 
modeling.  As  a  first  step  in  this  project,  we  have  implemented  the  Katz  and  W'itten-Hell  back-off 
algorithms  for  estimating  n-gram  language  models,  and  are  currently  evaluating  the  impact  of  these 
differences  on  recognition  performance.  We  developed  a  formalism  for  modeling  the  probability  of 
the  different  alternatives  people  have  for  verbalizing  numbers  and  punctuation,  recognizing  that 
in  spontaneous  dictation,  some  types  of  punctuation  are  more  likely  to  be  verbalized  than  others. 
Finally,  we  developed  a  mixture  language  model  formalism  that  represents  the  topic-dependent 
structure  of  language  at  the  utterance  level. 


Principal  Investigator  Name.  Mari  OstenJorf 
PI  Institu.ion.  Boston  University 
Pf  Phone  Number:  617-353-5130 
PI  E-mail  Address.  moiSraven  bu.edu 

Grant  or  Contract  Title:  Segment-Based  Acoustic  Models  hir  Continuous  Speech  Recognition 
Grant  or  Contract  Number.  ONR-NOOOM-9‘2-J-1778 
Reporting  Period:  1  April  1993  30  June  1993 

3  Publications  and  Presentations 

Papers  associated  with  this  work  and  written  during  the  reporting  period  include  a  site  reijort,  two 
conference  papers,  and  a  correspotidence  paper  that  wa.s  submitted  and  accepted  for  publication 
during  the  reporting  period.  A  journal  paper  documenting  our  prior  work  in  recognition  also 
appeared  during  this  period. 

•  ‘‘Fast  Search  Algorithms  for  Phone  Classification  and  Recognition  Using  .Segment -Based  .Mod¬ 
els."  \’.  Uigalakis,  M.  Ostendorf  and  J.  R.  Rohlicek,  IFHE  'Intitsaiiions  on  Snjudt  l‘ror(  ■->in(j. 
December  1992,  pp.  2885-2S9(i. 

•  "Segment  Based  Acoustic  .Models  for  Continuous  Speech  Recognition.”  M.  Ostendorf  and  J. 
R.  Rohlicek,  site  report  to  a[)pear  in  Proccf  dings  of  the  AH  PA  Workshop  on  Huriiun  Lnnguagt 
Technology,  1993. 

•  "On  the  Use  of  Tied-Mixture  Distributions,”  0.  Kimball  and  .M,  O.stendorf.  to  a])poar  in 
Proceedings  of  the  ARPA  Workshop  on  Human  Language  Technology,  1993. 

•  "A  Comparison  of  Trajectory  and  Mixture  Modeling  in  Segment-based  Word  Recogitition," 
A.  Kannan  and  M.  Ostendorf,  Proceedings  of  the  International  Conference  on  .Acoustics, 
Speech  and  Signal  Processing,  pp.  11327-330,  .Aitril  1993. 

•  "Maximum  Likelihood  Clustering  of  Gaussians  for  Speech  Recognition,"  .A.  Kannan.  M.  Os¬ 
tendorf  and  J.  R.  Rohlicek.  IEEE  Transactions  on  Speech  eind  Audio  Processing,  to  appear. 
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4  Transitions  and  DoD  Interactions 

This  grant  includes  a  subcontract  to  BUN,  and  the  research  results  and  software  is  availaltle  to 
them.  Thus  far,  we  have  collaborated  with  IIMN  bv  cotnbining  the  Hyblns  system  with  the  SSM 
in  N-Best  sentence  rescoring  to  obtain  improved  recognition  performance,  and  we  have  provided 
BBN  with  papers  and  technical  reports  to  facilitate  sharing  of  algorithmic  improvements.  On  their 
part,  BBN  has  been  very  helpful  to  us  in  our  W'S.l  porting  efforts,  providing  us  with  WSJ  data  and 
consulting  on  format  changes. 

Ihe  recognition  system  that  has  been  developed  under  tlie  support  of  this  grant  and  of  a 
joint  NSF-ARPA  grant  (NSF  jj-  lRI-890'212  1)  is  currently  being  used  for  automatically  obtaining 
good  quality  phottetic  aligntnenfs  for  a  corpus  of  radio  tiews  speech  utnler  development  at  Boston 
University.  The  alignment  effort  is  supported  liy  the  I.inguistic  Data  Consortium,  through  a  grant 
that  allowed  u.s  to  add  cross-word  phonologii  al  rules  to  the  segmentation  software. 
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Principal  Investigator  Name  Man  Ostendiprf 
PI  Institution:  Boston  I’niversity 
PI  Phone  Number:  ti  17-;$o;t-5 t;iO 
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5  Software  and  Hardware  Prototypes  • 

Our  rosearrh  has  rotjuirod  the  development  and  relinenieiit  of  software  systems  for  ji.iianietej  I's 
timation  and  recognition  st'arch.  which  are  implemented  in  ('  or  (ft  am!  run  on  Sun  Span 
workstat iotis.  No  commercialization  is  phumed  at  this  titne.  • 
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Fniuipal  Investigator  Name  Man  Osleiuloif 
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PI  K-mail  Address:  mo(tr<aven  bu  edu 
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tlrant  or  Coniraot  Number  ON  H- NOOd Mdl'd  J - 1 7TS 
Reporting  Period  1  .April  19yd  HO  June  1991! 


6  Vugrap^s  • 

Attariied  is  a  cpiad  cliart  tli;il  illust  r;il<‘s  the  acoustii  modeling  pbilo'opliv  of  this  project  and  list- 
■-lie  goals  aiui  key  ai'(om})lisliim'nts.  1  his  cliarl  w.is  provided  to  .A  R  PA  earlier  this  year. 
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